Skip to content

feat: opt-in telemetry with privacy-strict mode (ported from Node)#5

Merged
marlian merged 11 commits intomainfrom
feat/telemetry
Apr 15, 2026
Merged

feat: opt-in telemetry with privacy-strict mode (ported from Node)#5
marlian merged 11 commits intomainfrom
feat/telemetry

Conversation

@marlian
Copy link
Copy Markdown
Owner

@marlian marlian commented Apr 14, 2026

Summary

Ports the Node telemetry design to Go and adds a new privacy-strict mode. The original condition for this work (Step 3.1's "wait until the Go MCP entrypoint is adopted by a real client") has been met — the binary is in production use by Claude Code, Kilo, and Codex.

Preserved from Node

  • Opt-in via MEMORY_TELEMETRY_PATH — unset = zero overhead, no DB opened.
  • Separate SQLite DB from memory.db — different lifecycle, wipe freely.
  • Counts-only for results; observation content always reduced to <N chars>.
  • Client identity from MCP initialize.clientInfo with env fallback (Kilo, Claude Code, Cursor, Windsurf, VS Code Copilot).
  • Not an MCP tool — infrastructure for the human developer, not a model capability.

Go-native refinements

  1. Nil-tolerant *Client — when disabled, the client pointer is nil and every method returns immediately. Replaces the Node sprinkled if TELEMETRY_ENABLED guards.
  2. Zero globals — constructed in cmd/workmem/main.go, plumbed through mcpserver.Config{Telemetry: …}. Replaces module-level mutable pointers.
  3. SearchMemory returns SearchMetrics as a tuple — replaces the Node _lastSearchMetrics side channel.

New: privacy-strict mode

  • MEMORY_TELEMETRY_PRIVACY=strict sha256-hashes entity names, queries, and event labels before disk.
  • Intended for sensitive backends (e.g., private_memory behind therapy/health/relationship content).
  • Observation content is always stripped regardless of mode.
  • Trade-off documented in docs/TELEMETRY.md.

Commits

  1. feat: add internal/telemetry package — standalone package with unit tests
  2. refactor: SearchMemory returns SearchMetrics for telemetry hook — no behavior change, backward-compatible HandleTool shim
  3. feat: wire telemetry through dispatch with defer-based logging — single defer covers all exit paths; clientInfo via MCP handshake; 3 integration tests
  4. docs: telemetry guide, invariants, Step 3.4, decision entry

Test plan

  • Unit tests: nil-client safety, init failure, strict hashing, sanitize (observation stripping, identifier hashing), client detect (protocol priority + env fallback).
  • Integration test: enabled roundtrip — 3 tool_calls rows with duration > 0, 1 search_metrics linked to recall, no observation content in args_summary.
  • Integration test: disabled zero overhead — no telemetry file created, dispatch works normally.
  • Integration test: privacy-strict — Alice and sensitive query text never land in args_summary; search_metrics.query is sha256-hashed.
  • Full go test ./... green on darwin/arm64.
  • CI matrix (ubuntu / macos / windows) on this branch.

Follow-ups (separate PRs / issues)

  • workmem backup --to <path> --age-recipient <key> subcommand for cross-platform encrypted backup of memory + telemetry DBs. Discussed but deferred — the cloud backup story deserves its own PR.

Copilot AI review requested due to automatic review settings April 14, 2026 20:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports the Node telemetry subsystem to the Go implementation, adding an opt-in SQLite telemetry DB plus a privacy-strict mode, and plumbs ranking-pipeline search metrics through recall so the MCP transport can log them.

Changes:

  • Adds internal/telemetry (schema, sanitization, strict hashing, client detection) plus unit tests.
  • Refactors SearchMemory / tool dispatch to return SearchMetrics for telemetry logging.
  • Wires telemetry into the MCP server runtime with integration tests and adds docs/ops/decision-log updates.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
internal/telemetry/schema.go Telemetry SQLite schema + insert statements.
internal/telemetry/client.go Telemetry client lifecycle + logging APIs.
internal/telemetry/sanitize.go Args/result sanitization for safe logging (incl. strict hashing hooks).
internal/telemetry/detect.go Client identity detection (protocol-first, env fallback).
internal/telemetry/*_test.go Unit tests for hashing, sanitization, detection, and client behavior.
internal/store/search.go SearchMemory now returns SearchMetrics alongside results.
internal/store/tools.go Adds HandleToolWithMetrics to surface recall metrics to callers.
internal/store/parity_test.go Updates callsites for the new SearchMemory signature.
internal/mcpserver/server.go Wires telemetry into tool dispatch via a defer-based logger.
internal/mcpserver/telemetry.go Runtime helpers to log tool calls + search metrics.
internal/mcpserver/telemetry_integration_test.go End-to-end tests for enabled/disabled/strict telemetry flows.
cmd/workmem/main.go Constructs telemetry client from env and passes it into the runtime.
docs/TELEMETRY.md Usage, privacy modes, and schema documentation for telemetry.
OPERATIONS.md Adds telemetry invariants / operational expectations.
IMPLEMENTATION.md Marks telemetry step gate as complete and documents acceptance criteria.
DECISION_LOG.md Records the telemetry design decision and strict-mode rationale.

Comment thread internal/mcpserver/telemetry.go Outdated
Comment thread internal/telemetry/client_test.go
Comment thread internal/mcpserver/server.go
Comment thread internal/mcpserver/server.go Outdated
Comment thread docs/TELEMETRY.md Outdated
Comment thread docs/TELEMETRY.md Outdated
Comment thread internal/telemetry/client.go Outdated
Comment thread internal/mcpserver/telemetry_integration_test.go Outdated
Comment thread internal/telemetry/sanitize.go
marlian added a commit that referenced this pull request Apr 14, 2026
Nine comments from the Copilot auto-reviewer, all verified against the
code. Each addressed here:

1. mcpserver/telemetry.go resolveProjectPath: return "" on home-dir
   error instead of the raw project argument, matching the docstring
   and avoiding non-canonical paths leaking into telemetry rows.

2. telemetry/client_test.go TestInitIfEnabledInvalidPathReturnsNil:
   build the invalid path inside t.TempDir() + missing-subdir so the
   failure mode is identical on macOS, Linux, and Windows (the old
   "/nonexistent-…" was Unix-flavored).

3. mcpserver/server.go Runtime: close telemetry in Runtime.Close()
   (nil-safe, idempotent) and document ownership transfer in Config.
   Removed the now-redundant defer tele.Close() from cmd/workmem/main.go
   to avoid double-close semantics.

4. mcpserver/server.go handleTool: gate the time.Now() / defer block
   behind `if r.telemetry != nil` so the disabled path is literally a
   single pointer-nil check per call, matching the "zero overhead"
   promise.

5. docs/TELEMETRY.md: correct the lifecycle statement — the telemetry
   DB is opened at process startup (via FromEnv), not on first tool
   call.

6. docs/TELEMETRY.md: reword the disabled-path description to reflect
   the fast-path change from fix 4 (no timing, no logging, no DB
   opened; one pointer-nil check on the hot path).

7. telemetry/client.go InitIfEnabled: align pragmas with the main
   memory DB — enable foreign_keys so search_metrics -> tool_calls FK
   is enforced, set busy_timeout=5000 so brief lock contention retries
   instead of erroring, and SetMaxOpenConns(1) for deterministic write
   ordering.

8. mcpserver/telemetry_integration_test.go
   TestTelemetryDisabledViaFromEnvCreatesNoArtifacts: replaces the
   previous test that only asserted an arbitrary path did not exist
   (a vacuous claim). The new test drives the actual env -> FromEnv ->
   runtime path with MEMORY_TELEMETRY_PATH unset, and asserts that no
   telemetry artifact appears anywhere in the data dir after a full
   remember/recall cycle.

9. telemetry/sanitize.go SanitizeArgs: count runes instead of bytes
   when rendering "<N chars>" for redacted string fields, so the
   marker is accurate for non-ASCII content.
@marlian marlian requested a review from Copilot April 14, 2026 21:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

Comment thread internal/telemetry/client.go Outdated
Comment thread internal/mcpserver/telemetry_integration_test.go Outdated
Comment thread internal/mcpserver/telemetry_integration_test.go Outdated
marlian added a commit that referenced this pull request Apr 14, 2026
Three new comments after the previous fixes landed. All legitimate:

1. telemetry/client.go InitIfEnabled: align DSN pattern with the main
   memory DB. Use file:<cleanPath>?_pragma=foreign_keys(1) + db.Ping()
   instead of a raw path + separate PRAGMA. Safer on Windows drive
   letters, foreign keys are enforced from open time (not just at the
   first Exec), and the init failure mode becomes immediately
   observable via Ping.

2. telemetry_integration_test.go: the enabled-roundtrip and strict-mode
   tests were calling tele.Close() directly, violating the ownership
   contract the previous round introduced (Runtime owns the telemetry
   client once passed to New). Tests now call stop() explicitly before
   reopening the telemetry DB for readback, so Runtime.Close -> tele
   .Close -> flush happens through the documented path.

3. startTelemetrySession stop function made idempotent via sync.Once,
   so explicit stop() + deferred stop() converge safely. t.Fatal
   replaced with t.Error in the cleanup timeout path — calling Fatal
   from a deferred function aborts the goroutine but skips subsequent
   cleanup/readback assertions; Error is the correct level here.

The self-inflicted inconsistency between "Runtime owns the telemetry
client" (added last round in mcpserver.Config docstring) and the tests
closing it directly is exactly the kind of drift a second reviewer pass
catches. Good catch.
@marlian marlian requested a review from Copilot April 14, 2026 21:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.

Comment thread internal/mcpserver/telemetry_integration_test.go Outdated
Comment thread internal/telemetry/client.go
Comment thread internal/telemetry/client_test.go Outdated
Comment thread internal/mcpserver/telemetry_integration_test.go Outdated
Comment thread internal/telemetry/detect_test.go Outdated
Comment thread internal/telemetry/detect_test.go Outdated
Comment thread internal/mcpserver/telemetry.go
Comment thread internal/telemetry/client_test.go Outdated
marlian added a commit that referenced this pull request Apr 14, 2026
Eight new comments, grouped into three themes:

Cross-platform readback DSN (4 sites): client_test.go (two tests) and
telemetry_integration_test.go (two tests) were opening the telemetry
DB for readback with a raw path, while production code uses
file:<cleanPath>?_pragma=foreign_keys(1) (see internal/store/sqlite.go
openSQLite). The raw-path form can misbehave on Windows drive letters
and drifts from production. All four readbacks now use the same DSN
construction.

Belt-and-suspenders FK enforcement (1 site): telemetry/client.go
InitIfEnabled sets _pragma=foreign_keys(1) in the DSN but did not
issue PRAGMA foreign_keys=ON explicitly afterward. The main store's
openSQLite does both, because some driver/version combinations honor
one form but not the other. Telemetry now matches the same sequence.

Small drift cleanup (3 sites):
- detect_test.go TestDetectClientEnvFallbackKilo: remove the redundant
  t.Setenv pair before unsetMCPClientEnv — same two vars were set,
  cleared, and re-set, which adds noise without changing behavior.
- detect_test.go unsetMCPClientEnv: rewrite the trailing comment. The
  old wording warned about LookupEnv semantics, but DetectClient
  actually uses os.Getenv != "", so empty-string values are treated as
  absent for every current signal.
- mcpserver/telemetry.go resolveProjectPath: the docstring promised a
  "global-scope call without a path" on resolve failure, but
  logToolCall keeps db_scope="project" whenever the caller supplied a
  non-empty project argument, regardless of whether resolution
  succeeded. Rewrote the docstring to match the actual (and intended)
  behavior: scope reflects caller intent, only project_path column is
  NULL when resolution fails.
@marlian marlian requested a review from Copilot April 14, 2026 21:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment thread cmd/workmem/main.go Outdated
Comment thread internal/telemetry/client.go Outdated
Comment thread internal/telemetry/client.go
Comment thread internal/telemetry/sanitize.go
marlian added a commit that referenced this pull request Apr 14, 2026
Four new comments, all legitimate:

1. cmd/workmem/main.go: fix telemetry leak when mcpserver.New fails.
   FromEnv() was inlined directly in the Config literal — if New
   returned an error, the DB handle was already open and nobody closed
   it. Now FromEnv() lands in a local var; on New failure the code
   calls tele.Close() (nil-safe) before exiting. Ownership transfers
   to Runtime only after New succeeds, matching the documented
   contract.

2. internal/telemetry/client.go: Client.Close is now truly idempotent.
   After the first call it nil-s db, insertCall, insertSearch so the
   second call hits the new `if c.db == nil { return nil }` guard
   instead of trying to close an already-closed *sql.DB (which would
   have surfaced a confusing shutdown error). Runtime.Close already
   comments "nil-safe and idempotent" — now it's true.

3. internal/telemetry/{schema.go,client.go}: schema initialization no
   longer runs a single multi-statement Exec. Statements are listed
   separately in schemaStatements and exec'd one-by-one, matching the
   main store's InitSchema pattern. More portable across SQLite
   drivers and produces per-statement errors when one fails.

4. internal/telemetry/sanitize.go: add optional Summarizable interface
   as a fast path in SummarizeResult. Types that implement
   TelemetrySummary() skip the JSON marshal + unmarshal round-trip
   entirely. The fallback stays in place for all current store result
   types (zero behavior change this round); any type whose result size
   becomes a telemetry hotspot can now opt in by adding the method.

Tests:
- TestClientCloseIsIdempotent: open, close, close, close — all three
  must return nil, proving Close can be called defensively multiple
  times without surfacing an error.
- TestSummarizeResultUsesFastPathForSummarizable: a fake Summarizable
  type is serialized without going through the JSON fallback.
- TestSummarizeResultFallsBackToJSONForNonSummarizable: plain map
  values still produce the correct entity_groups / stored summary
  through the fallback path.
@marlian marlian requested a review from Copilot April 14, 2026 21:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment thread internal/telemetry/client.go
Comment thread internal/telemetry/client.go
Comment thread internal/store/search.go
Comment thread docs/TELEMETRY.md Outdated
marlian added a commit that referenced this pull request Apr 14, 2026
- telemetry: guard LogToolCall and LogSearchMetrics against post-Close
  panic by nil-checking db / prepared-statement fields, not just the
  receiver. Covers late-arriving tool calls during an orderly shutdown.
  Test: TestLogAfterCloseDoesNotPanic.
- search: record the trimmed query in SearchMetrics so two recall calls
  that only differ by whitespace do not appear as distinct queries in
  telemetry. Test: TestSearchMemoryMetricsQueryIsTrimmed (5 subcases).
- docs: TELEMETRY.md said batched arrays became "<N items>" but the code
  emits "<N facts>" / "<N observations>". Doc updated to match reality.
@marlian marlian requested a review from Copilot April 14, 2026 21:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Comment thread internal/telemetry/client.go
Comment thread internal/mcpserver/server.go Outdated
marlian added a commit that referenced this pull request Apr 14, 2026
Two concurrency concerns raised on c46ac66:

1. telemetry.Client had no synchronization: Close nil-ed db/stmt fields
   while LogToolCall / LogSearchMetrics read and used the same pointers,
   so a real concurrent shutdown could produce a data race (formally
   undefined behavior under Go's memory model) or a TOCTOU panic where
   Exec ran on a stmt being closed.

   Fix: add sync.Mutex to Client, protect Close + Log* under the lock.
   strict is immutable after construction and stays lock-free. The
   uncontended lock cost (a few ns) is dwarfed by SQLite Exec (µs-ms),
   so the hot path is unaffected.

2. Runtime.Close called r.telemetry.Close() before nil-ing r.telemetry,
   so an in-flight handler's telemetry defer could observe a non-nil
   pointer on a client that was concurrently closing.

   Fix: swap r.telemetry into a local, set the field to nil first, then
   Close the local. With the client mutex this is a defensive second
   layer rather than the primary safety guarantee, but it keeps future
   readers honest about lifecycle ordering.

Tests:
- TestClientCloseRacesWithLogging fires 32 logger goroutines firing
  LogToolCall + LogSearchMetrics in a tight loop while Close runs
  concurrently, then a final round of post-close calls that must
  no-op. Passes under `go test -race` — without the mutex the race
  detector flags the pointer-field accesses immediately.
@marlian marlian requested a review from Copilot April 14, 2026 23:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Comment thread internal/store/search.go
Comment thread internal/mcpserver/server.go Outdated
Comment thread internal/mcpserver/server.go Outdated
Comment thread internal/telemetry/client.go
Comment thread internal/telemetry/client.go
marlian added 4 commits April 15, 2026 01:42
Port of the Node telemetry design to Go with three refinements:

- Nil-tolerant *Client: InitIfEnabled returns nil when disabled, every
  method no-ops on nil receiver. Replaces per-callsite guards.
- No globals: client is plumbed via constructor, never module-level.
- Privacy-strict mode: new MEMORY_TELEMETRY_PRIVACY=strict option hashes
  entity/query/label values with sha256 before storage. Observation
  content is always reduced to <N chars> regardless of mode.

Package contents:
- schema.go   — DDL for tool_calls + search_metrics + indexes
- client.go   — Client, InitIfEnabled, FromEnv, LogToolCall, LogSearchMetrics
- sanitize.go — args/result summarizers, HTML escape disabled for readable sentinels
- hash.go     — sha256 helpers for strict mode
- detect.go   — client identity (MCP handshake primary, env fallback)

All ported env signals from Node preserved: KILO, CLAUDE_CODE_SSE_PORT,
CURSOR_TRACE_ID, WINDSURF_EXTENSION_ID, VSCODE_MCP_HTTP_PREFER,
TERM_PROGRAM. Added non-empty guard on VSCODE_MCP_HTTP_PREFER vs the Node
presence-only check — simpler semantics, same real-world detection.

Uses modernc.org/sqlite for the telemetry DB to preserve the pure-Go
single-binary invariant (no CGO added for analytics).
SearchMemory now returns (results, metrics, err). The metrics struct
captures channel counts, candidates_total, results_returned, score
distribution, and limit_requested — exactly what the search_metrics
telemetry table needs, with no coupling to the telemetry package.

Replaces the Node pattern of a module-level _lastSearchMetrics side
channel. Go callers get an explicit value and tests can assert on it
without any global state.

HandleTool keeps its original (any, error) signature for backward
compatibility. A new HandleToolWithMetrics sibling returns
(any, *SearchMetrics, error) and is called by the mcpserver dispatch
when it wants the search_metrics payload. The shared body lives in an
inner dispatchTool helper that writes metrics via an out-pointer —
avoids rewriting every case's return statement.

Callers updated: internal/store/tools.go recall handler,
internal/store/parity_test.go (5 call sites). No behavior change for
any existing test.
Entrypoint (cmd/workmem/main.go): reads MEMORY_TELEMETRY_PATH and
MEMORY_TELEMETRY_PRIVACY via telemetry.FromEnv(), passes the *Client to
mcpserver.Config. Deferred Close() on shutdown. When unset, FromEnv
returns nil and no DB is ever opened.

mcpserver dispatch (internal/mcpserver/server.go): handleTool now
measures wall-clock duration and uses a single defer to log on every
exit path (success, validation failure, dispatch error). The defer
captures four observables as the flow progresses — argObject,
toolResult, metrics, projectRaw, isError — and flushes them via
r.logToolCall + r.logSearchMetrics at the end.

mcpserver helpers (internal/mcpserver/telemetry.go): client detection
from MCP initialize handshake (req.Session.InitializeParams().ClientInfo)
with env fallback via telemetry.DetectClient. Resolves project path to
absolute when project arg is present.

Integration tests (3): enabled roundtrip asserts remember/recall/forget
each land a tool_calls row with duration > 0 and no observation content
leak, plus one search_metrics row linked to the recall tool_call.
Disabled path asserts no telemetry file is created. Strict mode asserts
Alice and sensitive-query text never appear in args_summary and
search_metrics.query is sha256-hashed.
- docs/TELEMETRY.md: full user-facing guide. Adapted from the Node
  telemetry.md, with privacy-strict mode documented (trade-off: ranking
  debug vs plaintext leak risk on sensitive backends). Includes example
  client config for Claude Code / governor env-file wiring.

- OPERATIONS.md invariants: telemetry is opt-in and never affects the
  success path; DB is physically separate from memory DB; strict mode
  sha256-hashes identifiers before disk. P1 "telemetry deferred" entry
  removed (condition met). P2 telemetry-schema entry removed (designed
  and implemented). Pre-Launch TODO no longer lists Kilo proof — already
  closed in PR #4.

- IMPLEMENTATION.md: new Step 3.4 Telemetry marked done with explicit
  Gate — enabled writes rows, disabled creates no DB, strict hashes.

- DECISION_LOG.md: append entry documenting the three Go-native
  refinements (nil-tolerant Client, no globals, SearchMetrics tuple)
  and the new privacy-strict mode. Explicitly rejects at-rest
  encryption with keychain for this iteration as over-scope.
marlian added 6 commits April 15, 2026 01:48
Nine comments from the Copilot auto-reviewer, all verified against the
code. Each addressed here:

1. mcpserver/telemetry.go resolveProjectPath: return "" on home-dir
   error instead of the raw project argument, matching the docstring
   and avoiding non-canonical paths leaking into telemetry rows.

2. telemetry/client_test.go TestInitIfEnabledInvalidPathReturnsNil:
   build the invalid path inside t.TempDir() + missing-subdir so the
   failure mode is identical on macOS, Linux, and Windows (the old
   "/nonexistent-…" was Unix-flavored).

3. mcpserver/server.go Runtime: close telemetry in Runtime.Close()
   (nil-safe, idempotent) and document ownership transfer in Config.
   Removed the now-redundant defer tele.Close() from cmd/workmem/main.go
   to avoid double-close semantics.

4. mcpserver/server.go handleTool: gate the time.Now() / defer block
   behind `if r.telemetry != nil` so the disabled path is literally a
   single pointer-nil check per call, matching the "zero overhead"
   promise.

5. docs/TELEMETRY.md: correct the lifecycle statement — the telemetry
   DB is opened at process startup (via FromEnv), not on first tool
   call.

6. docs/TELEMETRY.md: reword the disabled-path description to reflect
   the fast-path change from fix 4 (no timing, no logging, no DB
   opened; one pointer-nil check on the hot path).

7. telemetry/client.go InitIfEnabled: align pragmas with the main
   memory DB — enable foreign_keys so search_metrics -> tool_calls FK
   is enforced, set busy_timeout=5000 so brief lock contention retries
   instead of erroring, and SetMaxOpenConns(1) for deterministic write
   ordering.

8. mcpserver/telemetry_integration_test.go
   TestTelemetryDisabledViaFromEnvCreatesNoArtifacts: replaces the
   previous test that only asserted an arbitrary path did not exist
   (a vacuous claim). The new test drives the actual env -> FromEnv ->
   runtime path with MEMORY_TELEMETRY_PATH unset, and asserts that no
   telemetry artifact appears anywhere in the data dir after a full
   remember/recall cycle.

9. telemetry/sanitize.go SanitizeArgs: count runes instead of bytes
   when rendering "<N chars>" for redacted string fields, so the
   marker is accurate for non-ASCII content.
Three new comments after the previous fixes landed. All legitimate:

1. telemetry/client.go InitIfEnabled: align DSN pattern with the main
   memory DB. Use file:<cleanPath>?_pragma=foreign_keys(1) + db.Ping()
   instead of a raw path + separate PRAGMA. Safer on Windows drive
   letters, foreign keys are enforced from open time (not just at the
   first Exec), and the init failure mode becomes immediately
   observable via Ping.

2. telemetry_integration_test.go: the enabled-roundtrip and strict-mode
   tests were calling tele.Close() directly, violating the ownership
   contract the previous round introduced (Runtime owns the telemetry
   client once passed to New). Tests now call stop() explicitly before
   reopening the telemetry DB for readback, so Runtime.Close -> tele
   .Close -> flush happens through the documented path.

3. startTelemetrySession stop function made idempotent via sync.Once,
   so explicit stop() + deferred stop() converge safely. t.Fatal
   replaced with t.Error in the cleanup timeout path — calling Fatal
   from a deferred function aborts the goroutine but skips subsequent
   cleanup/readback assertions; Error is the correct level here.

The self-inflicted inconsistency between "Runtime owns the telemetry
client" (added last round in mcpserver.Config docstring) and the tests
closing it directly is exactly the kind of drift a second reviewer pass
catches. Good catch.
Eight new comments, grouped into three themes:

Cross-platform readback DSN (4 sites): client_test.go (two tests) and
telemetry_integration_test.go (two tests) were opening the telemetry
DB for readback with a raw path, while production code uses
file:<cleanPath>?_pragma=foreign_keys(1) (see internal/store/sqlite.go
openSQLite). The raw-path form can misbehave on Windows drive letters
and drifts from production. All four readbacks now use the same DSN
construction.

Belt-and-suspenders FK enforcement (1 site): telemetry/client.go
InitIfEnabled sets _pragma=foreign_keys(1) in the DSN but did not
issue PRAGMA foreign_keys=ON explicitly afterward. The main store's
openSQLite does both, because some driver/version combinations honor
one form but not the other. Telemetry now matches the same sequence.

Small drift cleanup (3 sites):
- detect_test.go TestDetectClientEnvFallbackKilo: remove the redundant
  t.Setenv pair before unsetMCPClientEnv — same two vars were set,
  cleared, and re-set, which adds noise without changing behavior.
- detect_test.go unsetMCPClientEnv: rewrite the trailing comment. The
  old wording warned about LookupEnv semantics, but DetectClient
  actually uses os.Getenv != "", so empty-string values are treated as
  absent for every current signal.
- mcpserver/telemetry.go resolveProjectPath: the docstring promised a
  "global-scope call without a path" on resolve failure, but
  logToolCall keeps db_scope="project" whenever the caller supplied a
  non-empty project argument, regardless of whether resolution
  succeeded. Rewrote the docstring to match the actual (and intended)
  behavior: scope reflects caller intent, only project_path column is
  NULL when resolution fails.
Four new comments, all legitimate:

1. cmd/workmem/main.go: fix telemetry leak when mcpserver.New fails.
   FromEnv() was inlined directly in the Config literal — if New
   returned an error, the DB handle was already open and nobody closed
   it. Now FromEnv() lands in a local var; on New failure the code
   calls tele.Close() (nil-safe) before exiting. Ownership transfers
   to Runtime only after New succeeds, matching the documented
   contract.

2. internal/telemetry/client.go: Client.Close is now truly idempotent.
   After the first call it nil-s db, insertCall, insertSearch so the
   second call hits the new `if c.db == nil { return nil }` guard
   instead of trying to close an already-closed *sql.DB (which would
   have surfaced a confusing shutdown error). Runtime.Close already
   comments "nil-safe and idempotent" — now it's true.

3. internal/telemetry/{schema.go,client.go}: schema initialization no
   longer runs a single multi-statement Exec. Statements are listed
   separately in schemaStatements and exec'd one-by-one, matching the
   main store's InitSchema pattern. More portable across SQLite
   drivers and produces per-statement errors when one fails.

4. internal/telemetry/sanitize.go: add optional Summarizable interface
   as a fast path in SummarizeResult. Types that implement
   TelemetrySummary() skip the JSON marshal + unmarshal round-trip
   entirely. The fallback stays in place for all current store result
   types (zero behavior change this round); any type whose result size
   becomes a telemetry hotspot can now opt in by adding the method.

Tests:
- TestClientCloseIsIdempotent: open, close, close, close — all three
  must return nil, proving Close can be called defensively multiple
  times without surfacing an error.
- TestSummarizeResultUsesFastPathForSummarizable: a fake Summarizable
  type is serialized without going through the JSON fallback.
- TestSummarizeResultFallsBackToJSONForNonSummarizable: plain map
  values still produce the correct entity_groups / stored summary
  through the fallback path.
- telemetry: guard LogToolCall and LogSearchMetrics against post-Close
  panic by nil-checking db / prepared-statement fields, not just the
  receiver. Covers late-arriving tool calls during an orderly shutdown.
  Test: TestLogAfterCloseDoesNotPanic.
- search: record the trimmed query in SearchMetrics so two recall calls
  that only differ by whitespace do not appear as distinct queries in
  telemetry. Test: TestSearchMemoryMetricsQueryIsTrimmed (5 subcases).
- docs: TELEMETRY.md said batched arrays became "<N items>" but the code
  emits "<N facts>" / "<N observations>". Doc updated to match reality.
Two concurrency concerns raised on c46ac66:

1. telemetry.Client had no synchronization: Close nil-ed db/stmt fields
   while LogToolCall / LogSearchMetrics read and used the same pointers,
   so a real concurrent shutdown could produce a data race (formally
   undefined behavior under Go's memory model) or a TOCTOU panic where
   Exec ran on a stmt being closed.

   Fix: add sync.Mutex to Client, protect Close + Log* under the lock.
   strict is immutable after construction and stays lock-free. The
   uncontended lock cost (a few ns) is dwarfed by SQLite Exec (µs-ms),
   so the hot path is unaffected.

2. Runtime.Close called r.telemetry.Close() before nil-ing r.telemetry,
   so an in-flight handler's telemetry defer could observe a non-nil
   pointer on a client that was concurrently closing.

   Fix: swap r.telemetry into a local, set the field to nil first, then
   Close the local. With the client mutex this is a defensive second
   layer rather than the primary safety guarantee, but it keeps future
   readers honest about lifecycle ordering.

Tests:
- TestClientCloseRacesWithLogging fires 32 logger goroutines firing
  LogToolCall + LogSearchMetrics in a tight loop while Close runs
  concurrently, then a final round of post-close calls that must
  no-op. Passes under `go test -race` — without the mutex the race
  detector flags the pointer-field accesses immediately.
Two concurrency concerns and two stderr-spam concerns:

1. Runtime.telemetry was a plain *Client pointer written during
   Close() and read by in-flight handleTool goroutines — a data race
   under -race, regardless of the Client's internal mutex.

   Fix: make it an atomic.Pointer[telemetry.Client]. handleTool calls
   Load() once and captures the pointer into its defer closure, so the
   deferred Log* uses the captured value instead of re-reading the field.
   Runtime.Close uses Swap(nil) to atomically take ownership for cleanup.
   logToolCall / logSearchMetrics now accept an explicit *telemetry.Client
   parameter and never touch the atomic field themselves.

2. LogToolCall / LogSearchMetrics printed "telemetry log failed: ..." to
   stderr on every Exec error. A persistently wedged telemetry DB (disk
   full, permission change, corruption) would therefore spam stderr once
   per tool call for the remainder of the session.

   Fix: add a `degraded` flag to Client. On the first Exec failure we
   emit a single warning ("further errors suppressed for this session")
   and set degraded=true; subsequent Log* calls take the degraded branch
   under the existing mutex and return silently. Recovery requires a
   restart — same contract as init failure.

Tests:
- TestClientDegradedModeSuppressesSpam captures stderr via os.Pipe,
  force-closes the underlying sql.DB to make Exec fail, fires three
  Log* calls, and asserts exactly one warning plus degraded=true.
- All existing tests pass under `go test -race`, including the
  concurrency regression test.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated no new comments.

@marlian marlian merged commit 305d13f into main Apr 15, 2026
12 checks passed
@marlian marlian deleted the feat/telemetry branch April 15, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants